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Abstract 

Reinforcement  learning  (RL)  algorithms  have  the  ability  to 
learn  optimal  policies  for  control  problems  by  exploring  a  do¬ 
main’s  state  space.  Unfortunately,  for  most  problems  the  size 
of  the  state  space  is  too  great  for  RL  technologies  to  fully 
explore  in  order  to  find  good  policies.  State  abstraction  is 
one  way  of  reducing  the  size  and  complexity  of  a  domain’s 
state  space  in  order  to  enable  RL.  In  this  paper  we  introduce 
a  new  approach  for  automatically  deriving  state  abstractions 
called  Evolutionary  Tile  Coding  that  uses  a  genetic  algorithm 
for  deriving  effective  tile  codings.  We  provide  an  empirical 
analysis  of  the  new  algorithm  comparing  it  to  another  adap¬ 
tive  tile  coding  method  as  well  as  fixed  tile  coding.  Our  re¬ 
sults  show  that  our  approach  is  able  to  automatically  derive 
effective  state  abstractions  for  two  RL  benchmark  problems. 
Additionally,  we  present  an  intriguing  result  that  shows  the 
classical  mountain  car  (Justin  Boyan  1995)  problem’s  state 
space  can  be  reduced  to  just  two  states  and  still  preserve  the 
discovery  of  an  optimal  policy. 

Introduction 

Technological  development  has  been  driving  toward  more 
complex  systems  that  require  faster  responses  to  events. 
Processes  that  used  to  require  human  intervention  quickly 
grow  beyond  the  human  operator’s  ability  to  respond.  Au¬ 
tonomous  control  techniques  can  provide  responses  as  fast 
as  needed  to  be  effective,  but  the  increasing  complexity  of 
the  tasks  makes  the  autonomous  controller  difficult  to  de¬ 
sign.  Reinforcement  learning  (Sutton  and  Barto  1998)  (RL) 
is  one  approach  that  can  be  used  to  automate  control  pro¬ 
cesses  and  provide  rich  solutions  that  are  robust  in  their  re¬ 
sponse  to  new  situations. 

Reinforcement  learning  algorithms  attempt  to  discover  an 
optimal  policy  for  a  given  domain  by  exploring  its  state 
space.  The  optimal  policy,  7t*,  is  a  prescriptive  function  that 
determines  the  appropriate  action  to  take  in  any  given  state 
that  will  result  in  the  maximum  aggregate  reward.  RL  algo¬ 
rithms  learn  n*  on-line  by  trying  various  actions  in  the  states 
it  experiences  and  observing  the  rewards  it  receives.  A  RL 
algorithm  must  experience  the  consequences,  r,  of  attempt¬ 
ing  a  particular  action,  a,  in  a  given  state,  s,  a  number  of 
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times  before  it  can  accurately  estimate  the  value  of  taking  a 
in  s  otherwise  known  as  the  Q-value,  Q(s,  a). 

A  significant  issue  that  has  hindered  the  application  of 
RL  algorithms  is  the  state  space  problem  also  known  as  the 
curse  of  dimensionality  (Bellman  1961).  Simply  put,  there 
are  too  many  states  in  a  problem’s  state  space  to  experience 
and  learn  over  for  most  domains.  The  state  of  a  problem  is 
defined  by  the  features  of  the  domain  as  well  as  the  number 
of  values  the  features  can  take.  Every  unique  combination  of 
feature  values,  if  interpreted  literally,  can  represent  a  unique 
state.  Adding  a  new  feature  to  the  representation  of  the  state 
increases  the  size  and  complexity  of  the  state  space  expo¬ 
nentially.  A  means  for  reducing  the  size  and  complexity  of 
state  spaces  is  necessary  for  RL. 

Fortunately,  it  has  been  found  that  in  most  domains  much 
of  the  information  that  describes  the  state  is  redundant  or  ir¬ 
relevant  with  regards  to  solving  the  problem.  State  abstrac¬ 
tion  methods  have  been  developed  to  simplify  state  spaces 
by  making  generalizations  that  remove  redundancies  and 
hide  unnecessary  information  (Sutton  1996;  Li,  Walsh,  and 
Littman  2006;  Wright  and  Gemelli  2009).  In  this  paper  we 
focus  on  one  particular  state  abstraction  approach  known  as 
tile  coding  (Sutton  and  Barto  1998). 

Tile  coding  is  a  form  of  state  abstraction  for  domains  with 
continuous  states  spaces.  It  discretizes  the  state  space  into 
tiles  that  cover  ranges  of  values  for  each  feature  in  the  state 
space.  Every  state  that  falls  under  a  specific  tile  is  treated  as 
the  same  abstract  state  and  the  RL  algorithm  learns  over  the 
abstract  state  space.  The  effectiveness  of  tile  coding  meth¬ 
ods  depends  heavily  on  the  design  of  the  tiling  scheme.  If 
there  is  insufficient  resolution  in  a  particular  area  of  the  state 
space  the  RL  algorithm  will  not  be  able  to  find  n*.  As  a  re¬ 
sult  the  design  and  implementation  of  tile  coding  schemes 
has  been  a  manual  and  time  consuming  process  that  requires 
significant  domain  expertise  to  be  effective. 

Recently,  there  has  been  work  in  automated  tiling  meth¬ 
ods  that  attempt  to  derive  an  effective  tiling  scheme  on-line 
(Uther  and  Veloso  1998;  Whiteson,  Taylor,  and  Stone  2007). 
In  this  paper  we  introduce  a  new  automated  tile  coding  al¬ 
gorithm  called  Evolutionary  Tile  Coding  (EvoTC).  EvoTC 
uses  a  genetic  algorithm  to  derive  efficient  tile  structures  that 
maximizes  an  RL  algorithm’s  ability  to  find  a  good  policy. 
We  compare  the  performance  of  EvoTC  to  competing  fixed 
and  automated  tile  coding  methods,  CMAC  and  Adaptive 
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Tile  Coding.  And  we  show  that  EvoTC  is  able  to  provide 
more  efficient  tile  based  state  abstractions  that  should  help 
enable  RL  algorithms  to  scale  towards  more  complex  prob¬ 
lems. 

The  rest  of  the  paper  proceeds  as  follows.  In  the  next 
section  we  provide  background  and  details  on  the  two  tile 
coding  approaches  we  use  for  comparison.  We  then  intro¬ 
duce  and  describe  EvoTC  in  detail.  This  is  followed  by  a 
description  of  our  experimental  setup  and  results.  We  con¬ 
clude  with  a  discussion  of  the  results  and  a  summary  of  the 
conclusions  we  were  able  to  make. 

Background 

Cerebellar  Model  Articulation  Controller 

The  Cerebellar  Model  Articulation  Controller  algorithm, 
better  known  as  CMAC,  was  introduced  in  (W.T.,  F.H.,  and 
L.G.  1990)  (then  called  the  Cerebellar  Model  Arithmetic 
Computer)  as  a  means  of  providing  local  generalization  of 
the  state  space  based  on  how  the  human  brain  is  thought 
to  respond  to  stimuli  (Albus  1971).  This  behavior  allows 
states  that  are  in  proximity  to  an  observed  state  to  learn  even 
though  those  states  have  not  been  observed  themselves.  It 
was  chosen  for  our  analysis  because  it  is  arguably  the  most 
popular  of  the  tile  coding  methods  (Sutton  1996). 

CMAC  partitions  state  spaces  into  a  fixed  set  of  non¬ 
overlapping  tiles.  Q- values  that  are  learned  from  any  one 
state  in  a  tile  are  learned  for  all  states  in  the  tile.  Partition¬ 
ing  the  state  space  into  many  small  tiles  will  slow  learning 
but  will  improve  the  probability  of  finding  optimal  policies. 
Conversely,  if  the  tiles  are  very  large  then  O-values  will  be 
distributed  quickly  across  many  states,  but  there  is  no  guar¬ 
antee  that  two  states  on  opposite  sides  of  a  tile  should  share 
the  same  action  values.  In  this  case,  each  state  may  favor  a 
different  action,  but  only  one  action  can  be  preferred  per  tile, 
preventing  a  correct  policy  from  being  found.  This  tradeoff 
is  mitigated  by  overlapping  layers  of  tiles  to  provide  both 
coarse  and  fine  grain  generalization.  Each  observed  state 
updates  one  tile  per  layer,  and  each  of  these  tiles  covers  a 
different  portion  of  the  state  space.  The  preferred  action  for 
a  state  is  the  action  that  maximizes  the  weighted  sum  of  ac¬ 
tion  values  across  all  tiles  that  contain  that  state. 

The  CMAC  algorithm  has  effectively  learned  a  number 
of  domains  including  the  mountain  car  and  single  pole  bal¬ 
ance  (Sutton  1996).  More  recently,  it  has  been  shown  to 
suffer  from  some  limitations  on  slightly  more  complicated 
problems  like  the  double  pole  balance  (Gomez,  Schmidhu- 
ber,  and  Miikkulainen  2006).  The  main  difficulty  in  apply¬ 
ing  CMAC  is  choosing  a  suitable  way  to  break  up  the  state 
space  into  tiles.  If  this  is  done  inexpertly  then  states  that  do 
not  prefer  the  same  action  can  be  forced  to  learn  together 
if  they  are  both  confined  to  a  single  tile.  This  will  severely 
slow  down,  if  not  prevent,  the  learning  of  a  successful  pol¬ 
icy.  A  secondary  concern  is  the  memory  requirements  for 
high  dimensional  scenarios.  The  number  of  tiles  per  map¬ 
ping  scales  exponentially  in  the  number  of  input  percep¬ 
tions  for  a  problem,  and  storing  all  visited  tiles  can  quickly 
become  unreasonable.  Hashing  techniques  like  those  men¬ 
tioned  in  (W.T.,  F.H.,  and  L.G.  1990)  can  be  used  to  place 


limits  on  the  memory  requirements  of  CMAC,  but  they  ef¬ 
fectively  cause  non-local  generalization  of  learned  values  in 
the  event  of  a  hash  collision,  which  can  negatively  impact 
policy  convergence. 

Adaptive  Tile  Coding 

Adaptive  tile  coding  (ATC)  (Whiteson,  Taylor,  and  Stone 
2007)  is  a  tile  coding  algorithm  that  automatically  derives 
variable  resolution  state  abstraction  while  learning  a  policy 
for  a  specific  problem.  It  is  similar  to  the  continuous  U- 
Tree  algorithm  discussed  in  (Uther  and  Veloso  1998).  Both 
methods  derive  abstractions  by  starting  with  a  single  tile  that 
encompasses  the  entire  state  space.  Based  on  observations 
made  while  an  RL  algorithm  attempts  to  learn  over  the  ab¬ 
stract  state  space,  “splits”  are  introduced.  Splits  divide  in¬ 
dividual  tiles  evenly  along  feature  dimensions  into  two  new 
abstract  states.  The  idea  is  to  increase  the  resolution  only 
in  areas  of  the  state  space  where  changes  in  action  choices 
should  be  made.  Splitting  continues  until  the  RL  algorithm 
is  able  to  solve  the  problem  using  the  derived  abstract  state 
space.  Determining  when  and  where  to  split  tiles  is  the  only 
significant  difference  between  these  methods.  Heuristics  is 
used  for  ATC  (Whiteson,  Taylor,  and  Stone  2007)  and  a 
statistical  method  is  used  for  continuous  U-tree  (Uther  and 
Veloso  1998). 

ATC  uses  two  heuristics  to  determine  first  when  to  split 
and  then  where  to  split.  The  first  heuristic  keeps  track  of  the 
lowest  Bellman  error  per  time  step.  If  the  lowest  Bellman 
update  fails  to  change  for  a  specified  consecutive  number 
of  updates,  split  threshold ,  then  the  heuristic  has  determined 
learning  has  stopped  and  it  is  appropriate  to  split  a  tile.  Once 
it  has  been  determined  that  it  is  appropriate  to  split  the  pol¬ 
icy  criterion  heuristic  determines  where  to  split.  The  ATC 
algorithm  updates  the  Q- values  for  all  potential  tiles  in  the 
tilings.  Every  time  a  potential  tile  within  the  current  acti¬ 
vated  tile  prescribes  a  differing  action  from  the  activated  tile 
it  updates  a  counter  for  the  potential  tile.  ATC  splits  the 
tile  with  the  potential  tile  that  has  the  highest  counter  value 
to  establish  that  potential  tile  in  the  tiling.  This  process  in¬ 
creases  the  resolution  of  the  tiling  in  areas  where  a  changes 
in  policy  are  likely. 

In  (Whiteson,  Taylor,  and  Stone  2007)  it  was  shown  the 
ATC  has  a  number  of  advantages  over  CMAC.  First,  the 
tilings  are  derived  automatically  eliminating  the  need  to 
manually  design  and  discover  an  effective  tiling.  Second, 
ATC  was  found  to  be  faster  at  finding  n*  than  CMAC  us¬ 
ing  the  best  found  parameters  for  the  number  of  tiles  and 
tilings.  The  reason  for  the  improvement  is  that  the  RL  algo¬ 
rithm  benefits  from  the  generalization  of  the  overly  abstract 
state  space  early  in  the  learning.  As  the  abstract  state  space 
becomes  more  specific  the  new  states  are  already  partially 
learned  because  they  retain  the  values  learned  from  the  more 
general  state  they  were  split  from. 

Although  this  approach  is  an  improvement  over  fixed  tile 
coding  methods  like  CMAC,  it  suffers  from  a  significant 
drawback.  This  approach  splits  the  tiles  in  half  evenly.  It 
is  highly  unlikely  that  such  a  split  will  be  positioned  exactly 
where  there  is  a  decision  point  in  which  taking  one  action 
should  be  preferred  over  another.  These  methods  can  make 
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up  for  this  by  successively  splitting  sub-tiles  until  the  deci¬ 
sion  point  is  reached.  However,  many  unnecessary  states  are 
introduced  by  doing  this  and  it  will  slow  the  RL. 

Evolutionary  Tile  Coding 

Evolutionary  Tile  Coding  (EvoTC)  is  a  new  approach  that 
takes  flexible  state  space  arrangement  even  further.  Like  the 
other  adaptive  tile  coding  approaches  it  starts  with  a  single 
tile  that  encompasses  the  entire  state  space  and  introduces 
splits  to  increase  the  detail  of  the  abstraction.  The  major 
differences  are  that  EvoTC  uses  an  evolutionary  algorithm 
(Holland  1992)  to  determine  when  and  where  to  place  the 
splits  and  the  splits  can  divide  tiles  unevenly.  By  dividing 
tiles  unevenly  EvoTC  should  be  able  to  derive  more  effi¬ 
cient  and  effective  tiling  abstractions  than  other  existing  tile 
coding  approaches. 

In  ATC  and  continuous  U-tree  the  splits  are  placed  in 
the  center  of  tiles  because  it  is  difficult  to  determine  ex¬ 
actly  where  the  optimum  split  should  be  made.  So,  instead 
they  hone  in  on  the  correct  position  by  adding  additional 
splits.  The  additional  splits  are  unnecessary  and  slow  learn¬ 
ing.  EvoTC  is  able  to  find  better  split  positions  by  framing 
the  problem  of  finding  the  optimal  position  and  number  of 
splits  as  an  optimization  problem  where  the  performance  of 
the  RL  algorithm  is  optimized. 

EvoTC  starts  with  an  initial  population  of  tilings  to  be 
evaluated.  Each  tiling  is  evaluated  independently  by  pairing 
it  with  an  RL  algorithm  that  attempts  to  solve  a  problem  us¬ 
ing  the  tiling  as  a  state  abstraction  device.  The  performance 
of  the  RL  algorithm  is  considered  the  fitness  of  the  tiling. 
Tilings  that  are  more  effective  at  abstracting  the  state  space 
should  enable  the  RL  algorithm  to  perform  better.  After  all 
members  of  the  population  are  evaluated  the  fittest  tilings 
are  kept  for  successive  generations.  New  tilings  based  on 
the  fittest  members  of  the  previous  generation  are  also  in¬ 
troduced  into  the  population  for  the  next  generation.  The 
new  tilings  are  generated  by  applying  mutation  operators, 
described  later,  to  the  current  fittest  members  of  the  popu¬ 
lation.  The  new  population  is  then  evaluated  in  the  same 
manner  the  previous  generation  was.  Over  the  course  of  gen¬ 
erations  a  tiling  should  be  produced  that  will  enable  the  RL 
algorithm  to  exceed  a  specified  performance  threshold  and 
the  algorithm  terminates. 

In  the  following  we  provide  details  on  how  the  tilings  are 
represented  in  the  evolutionary  algorithm  and  how  the  mu¬ 
tation  operators  function: 

Genetic  Representation  of  Tiles  Each  chromosome  in 
EvoTC  represents  a  single  unique  tiling.  The  chromosomes 
hold  a  tile  arrangement  described  as  a  binary  decision  tree. 
The  genes  that  make  up  the  chromosome  describe  the  nodes 
in  the  tree.  Leaf  nodes  represent  a  current  tile  and  hold  the 
Q- values  associated  with  the  abstract  state.  Non-leaf  nodes 
represent  tile  divisions  and  describe  along  what  feature  the 
division  is  made  and  its  position.  See  figure  1  for  an  illus¬ 
tration  of  how  the  tiling  is  represented  as  a  tree.  The  genetic 
representation  is  non-fixed  to  enable  the  tiling  to  become 
more  complex  as  needed.  The  process  of  how  the  chromo¬ 
some  is  extended  is  described  in  the  divide  mutation  operator 
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Figure  1:  This  figure  illustrates  how  EvoTC  represents  the 
tile  discretizations  of  a  state  space  as  a  binary  decision  tree. 
Left:  shows  a  sample  discretization  of  the  two  dimensional 
state  space  of  the  mountain  car  problem.  Right:  shows  the 
corresponding  decision  tree  which  is  used  to  lookup  the  (re¬ 
values  associated  with  the  individual  tiles. 

Mutation  Operators  The  key  to  the  EvoTC  is  its  muta¬ 
tion  operators  which  make  diverse  tile  arrangements  in  the 
search  for  the  optimal  arrangement.  These  mutation  oper¬ 
ators  are  applied,  with  a  specified  probability,  to  existing 
chromosomes  in  the  population  to  make  new  chromosomes 
at  the  end  of  each  generation.  Two  mutation  operators  are 
used  in  this  algorithm: 

•  The  shift  operator  moves  the  position  of  tile  splits.  The 
purpose  of  this  mutation  operator  is  to  explore  the  abil¬ 
ity  of  the  existing  tiling  arrangement  to  properly  abstract 
the  state  space.  As  such  this  operator  should  be  activated 
with  a  higher  probability  than  the  divide  operator  which 
changes  the  structure  of  the  tiling  arrangement. 

When  this  mutation  operator  is  activated  it  selects  a  num¬ 
ber  of  division  nodes  to  be  modified  at  random.  For  each 
selected  node,  the  position  of  the  divide  is  shifted  by  a 
small  amount  determined  by  a  gaussian  random  distribu¬ 
tion  up  to  within  1%  of  the  edge  of  the  tile.  This  prevents 
a  pair  of  adjacent  tiles  from  effectively  becoming  one  tile 
if  one  of  the  tiles  holds  0%  of  the  state  space.  After  the 
selected  genes  are  altered,  the  tree  is  updated  with  the  mu¬ 
tated  genes. 

•  The  divide  operator  introduces  new  splits  to  the  tiling 
to  add  granularity  to  the  abstract  state  space.  It  should 
have  a  relatively  low  probability  of  being  activated  to  give 
the  shift  operator  sufficient  time  to  explore  more  general 
tilings. 

This  operator  functions  by  selecting  a  single  leaf  node  at 
random  to  divide.  The  node  is  divided  by  randomly  se¬ 
lecting  a  dimension  to  divide  along  and  the  division  is 
placed  using  a  random  gaussian  distribution  over  the  cen¬ 
ter  of  the  tile.  Once  the  divide  is  set,  new  leaf  nodes,  and 
genes  are  created  and  attached  to  the  new  divide  node.  Fi¬ 
nally,  the  Q- values  for  the  new  leaf  nodes  are  initialized 
to  a  value  that  encourages  exploration  of  the  new  tiles. 
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Experiments 

We  conducted  an  empirical  comparison  of  CMAC,  ATC,  and 
EvoTC  on  two  well  known  RL  benchmark  problems  with 
continuous  state  spaces.  The  purpose  of  these  algorithms  is 
to  reduce  the  size  and  complexity  of  domains’  state  spaces 
and  enable  a  RL  algorithm  to  discover  an  optimal  policy  for 
problems  in  those  domains.  We  measure  the  effectiveness  of 
the  approaches  by  the  number  of  states  in  the  abstract  state 
space  and  by  the  number  of  learning  updates  required  by  the 
RL  algorithm  to  learn  an  optimal  policy.  The  fewer  the  num¬ 
ber  of  states  in  abstract  state  space  the  better  the  method’s 
ability  to  effectively  abstract  the  state  space.  And,  the  fewer 
the  number  of  updates  required  by  the  RL  algorithm  to  learn 
an  optimal  policy  the  better  the  state  abstraction. 

The  following  is  a  description  of  the  benchmark  problems 
used  and  our  experimental  setup.  It  should  be  noted  that  all 
the  methods  require  some  parameter  tweaking  in  order  to 
achieve  their  best  performance.  In  our  comparisons  we  used 
the  best  found  parameter  settings  for  each  method.  The  pa¬ 
rameters  used  for  each  method  and  problem  are  specified  be¬ 
low.  Each  method  was  paired  with  the  RL  algorithm  SARSA 
(Sutton  and  Barto  1998)  to  derive  policies.  Also,  the  results 
shown  for  EvoTC  are  representative  of  the  median  value  of 
25  separate  runs.  Because  EvoTC  is  dependent  on  a  stochas¬ 
tic  search  several  runs  with  different  random  seeds  were  nec¬ 
essary  to  properly  characterize  its  performance. 

Mountain  Car 

The  mountain  car  problem  is  a  classical  control  RL  problem 
in  which  the  learner  has  to  derive  a  policy  to  enable  an  au¬ 
tomobile  to  escape  a  deep  valley.  The  car  does  not  have 
enough  power  to  drive  up  the  sides  of  the  valley  starting 
from  a  standing  position.  To  get  out  the  driver  must  build  up 
enough  momentum  by  rocking  back  and  forth.  Two  continu¬ 
ous  features,  position  and  velocity,  specify  the  state.  At  each 
time  step  the  RL  algorithm  has  to  select  one  of  three  possi¬ 
ble  actions;  accelerate  to  the  left  or  right,  or  coast.  A  reward 
signal  of  -1  for  every  time  step  the  car  has  not  reached  the 
goal  state  is  provided  to  encourage  the  discovery  of  a  policy 
that  reaches  the  goal  state  in  as  few  time  steps  as  possible. 

We  use  a  problem  set  of  100  different  starting  positions 
and  initial  velocities  to  represent  the  problem  domain  in  our 
experiments.  The  algorithms  are  evaluated  based  on  the  av¬ 
erage  performance  over  all  instances  in  the  problem  set.  For 
our  problem  set  an  optimal  policy  enables  the  car  to  escape 
the  valley  in  average  of  50  time  steps. 

In  our  experiments  for  CMAC  we  used  2  layers  of  tiling 
with  1 1  tiles  per  feature  for  each  layer.  This  allows  a  max¬ 
imum  of  242  possible  unique  abstract  states.  ATC  requires 
the  split  threshold  parameter  be  specified.  For  the  moun¬ 
tain  car  problem  we  found  a  value  of  521  to  work  well. 
EvoTC  requires  the  mutation  probabilities  be  specified.  For 
this  problem  values  of  32%  for  shift  and  5%  for  divide  per 
tiling  per  generation  were  used.  A  population  size  of  100 
was  also  used  for  each  evolutionary  generation. 

Pole  Balance 

The  pole  balance  problem  models  a  car  balacing  a  long  pole 
attached  on  a  hinge  (Barto,  Sutton,  and  Anderson  1990). 


Table  1 :  Results  for  Mountain  Car 


Number  of  Updates 

Number  of  States 

CMAC 

1.22e+05 

177 

ATC 

1.88e+05 

83 

EvoTC 

2.00e+07 

2 

The  car  is  free  to  travel  on  a  short  track  to  keep  the  pole 
balance  vertically  over  the  car.  Failure  occurs  if  the  pole 
falls  more  than  12  degrees  from  vertical  or  if  the  car  rolls  off 
either  end  of  the  short  track.  The  state  is  represented  by  4 
continuous  features;  the  position  and  velocity  of  the  car,  and 
the  angle  and  angular  velocity  of  the  pole.  There  are  three 
available  actions;  accelerate  to  the  left,  to  the  right,  and  to 
coast. 

We  use  a  problem  set  of  20  different  initial  feature  values 
in  our  experiments.  The  goal  for  the  algorithms  to  find  a 
policy  that  keeps  the  pole  balanced  for  at  least  106  updates 
without  dropping  the  pole  or  exceeding  the  bounds  of  the 
track. 

For  CMAC,  the  settings  of  2  layers  of  tilings  with  1 1  tiles 
per  dimension  of  input  per  layer  is  again  selected  for  this 
test  for  a  maximum  of  29282  states.  The  settings  selected 
for  EvoTC  are  30%  for  shift  and  12%  for  divide.  We  were 
unable  to  successfully  apply  ATC  to  this  problem. 

Results  and  Discussion 

The  results  of  the  mountain  car  and  pole  balance  are  listed 
in  Table  1.  All  three  methods  were  able  to  converge  to  an 
optimal  policy.  We  can  see  that  CMAC  was  able  to  solve  the 
mountain  car  problem  in  the  fewest  number  of  updates.  This 
is  slightly  surprising  because  it  was  shown  that  ATC  was 
able  to  outperform  CMAC  on  this  problem  in  (Whiteson, 
Taylor,  and  Stone  2007).  We  were  not  able  to  reproduce 
that  result.  However,  this  result  is  intuitive  in  that  the  fixed 
CMAC  tile  coding  was  tuned  for  this  problem  and  was  found 
as  a  result  of  many  trial  runs.  ATC  and  EvoTC  have  to  learn 
their  tile  abstractions  and  this  requires  some  additional  time 
and  updates. 

It  should  be  noted  that  EvoTC  is  penalized  by  the  update 
metric  becasue  all  the  updates  required  by  the  failed  mem¬ 
bers  of  the  population  are  included.  Including  the  aggregate 
updates  required  for  all  the  members  of  the  population  is 
necessary  to  get  an  accurate  measure  of  computation  time 
required.  However,  each  evaluation  of  a  tiling  per  genera¬ 
tion  could  be  done  independently  in  parallel,  which  would 
result  in  a  significant  speed  up  of  this  algorithm. 

Table  1  also  shows  the  size  of  the  abstract  state  space  re¬ 
quired  for  each  method.  CMAC  only  uses  177  of  the  poten¬ 
tial  242  states  available.  EvoTC  and  ATC  are  able  to  solve 
the  mountain  car  using  substantially  smaller  state  spaces 
which  shows  they  derive  much  more  efficient  state  abstrac¬ 
tions.  This  suggests  that  they  will  be  able  to  scale  more  ef¬ 
fectively  as  the  size  of  the  state  spaces  increase. 

The  most  striking  result  of  this  experiment  is  that  EvoTC 
was  able  to  derive  an  optimal  policy  using  an  abstract  state 
space  consisting  of  only  2  states.  ,  EvoTC  was  consistently 
able  to  find  this  state  abstraction  during  our  experimentation. 


4 


45 


5 


Figure  2:  This  figure  shows  how  EvoTC  algorithm  dis¬ 
cretized  the  mountain  car  state  space 


Figure  3:  This  figure  shows  how  the  ATC  algorithm  dis¬ 
cretized  the  mountain  car  state  space 


The  mountain  car  problem  is  one  of  the  classic  RL  control 
problems.  It  is  considered  difficult  due  to  its  continuous  state 
space.  EvoTC  simplified  it  to  a  simple  two  state  problem 
which  is  trivial  for  a  RL  algorithm  to  find  a  policy  for.  Not 
only  that,  EvoTC  was  able  to  eliminate  the  need  for  an  entire 
feature.  The  only  split  in  the  state  space  occurs  at  .477  of 
the  velocity  vector.  There  are  no  divisions  over  the  position 
feature  which  means  it  is  not  relevant  at  all  to  solving  the 
problem.  This  result  highlights  the  power  of  automated  state 
abstraction  to  find  unintuitive  and  effective  abstractions. 

This  experiment  also  shows  how  important  the  design  of 
the  abstraction  can  be.  Figures  2  and  3  shows  the  abstract 
state  spaces  derived  by  EvoTC  and  ATC  respectively.  The 
abstraction  derived  by  ATC  is  significantly  more  complex 
than  the  one  derived  by  EvoTC  and  includes  divisions  across 
the  position  vector.  ATC  cannot  find  the  same  abstraction 
that  EvoTC  is  able  to  find  because  it  arbitrarily  divides  each 
tile  evenly.  As  a  result  it  had  to  derive  a  much  more  complex 
abstract  state  space  to  learn  and  equivalent  policy. 


Table  2:  Results  for  Pole  Balance 


Number  of  Updates 

Number  of  States 

CMAC 

3.69e+08 

5379 

ATC 

failed  to  converge 

failed  to  converge 

EvoTC 

1.04e+09 

61 

Table  2  shows  the  results  we  obtained  applying  these 
methods  to  the  pole  balance  problem.  The  pole  balance 
problem  is  significantly  more  difficult  than  the  mountain 
car  problem  in  that  it  has  double  the  number  of  continu¬ 
ous  features.  As  such,  we  can  see  that  CMAC  still  requires 
the  fewest  updates,  but  required  significantly  more  abstract 
states  in  order  to  solve  this  problem.  In  our  experiments  we 
were  unable  to  find  a  parameter  setting  that  enable  ATC  to 
converge.  Once  again  EvoTC  was  able  to  derive  an  abstrac¬ 
tion  with  far  fewer  states  and  still  allows  the  RL  algorithm 
to  find  an  optimal  policy.  EvoTC  still  required  an  order  of 
magnitude  more  updates  than  CMAC,  however  the  increase 
in  number  of  updates  and  states  required  by  CMAC  com¬ 
pared  to  EvoTC  further  implies  that  EvoTC  will  scale  more 
effectively  as  the  size  of  the  state  space  is  increased. 

Observations  and  Discussion 

In  our  testing  we  found  that  all  methods  were  extremely  sen¬ 
sitive  to  untuned  parameter  settings.  Slight  changes  to  the 
parameter  settings  that  work  for  a  domain  could  very  eas¬ 
ily  prevent  these  methods  from  converging  again.  This  was 
especially  true  of  the  ATC  algorithm,  which  required  a  sub¬ 
stantial  amount  of  trial  and  error  to  find  a  parameter  set¬ 
ting  that  worked  for  the  mountain  car.  Finding  settings  for 
CMAC  and  EvoTC  was  significantly  less  time  consuming, 
but  still  required  some  trial  and  error. 

Although  EvoTC  and  CMAC  were  able  to  solve  both 
benchmark  problems  it  does  not  appear  that  either  method 
will  scale  adequately  as  the  number  of  features  that  describe 
the  state  space  is  increased.  Both  methods  are  tile  coding 
based  and  are  linear  abstractions  of  the  state  space.  As  a  re¬ 
sult,  although  the  abstract  state  spaces  found  by  these  meth¬ 
ods  are  significantly  smaller  than  the  actual  state  space,  they 
will  still  scale  proportionally  as  the  number  of  features  is  in¬ 
creased.  It  may  be  the  case  that  non-linear  state  abstraction 
methods  such  as  RL-SANE  (Wright  and  Gemelli  2009)  are 
necessary  as  the  number  of  features  are  increased. 

Conclusion 

Real  world  applications  have  large  continuous  state  spaces 
that  prevent  the  use  of  RL  algorithms.  State  abstraction 
methods  such  as  tile  coding  are  necessary  in  order  to  ap¬ 
ply  RL  to  non-trivial  problems.  Fixed  tile  coding  algorithms 
such  as  CMAC  can  be  effective  as  long  as  the  tiling  scheme 
is  properly  designed.  Adaptive  tile  coding  methods  like  ATC 
and  EvoTC  are  appealing  because  they  do  not  require  man¬ 
ual  design  of  the  state  abstraction.  In  this  paper  we  intro¬ 
duced  EvoTC  and  showed  how  it  is  able  to  abstract  the  state 
space  more  effectively  than  CMAC  and  ATC  on  two  contin¬ 
uous  state  space  problems. 
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Not  only  was  EvoTC  able  to  outperform  CMAC  and  ATC 
in  terms  of  abstraction  power  it  was  able  to  reduce  the  clas¬ 
sical  mountain  car  domain  to  a  problem  consisting  of  just 
two  states.  This  result  highlights  the  power  and  importance 
of  automated  state  abstraction  methods. 

Although  EvoTC  was  able  to  very  effectively  abstract  the 
state  space  of  the  mountain  car  problem  it  does  not  appear 
that  the  approach  will  scale  well  as  the  number  of  features 
that  describe  the  domain  are  increased.  We  believe  this  is 
due  to  the  linear  nature  of  the  tiling  abstraction.  Although 
the  tilings  are  gross  abstractions  of  the  state  space  the  dimen¬ 
sionality  of  the  abstract  state  space  is  the  same  as  the  orig¬ 
inal  state  space.  In  future  work  we  will  explore  the  deriva¬ 
tion  of  non-linear  state  abstraction  devices  such  as  multi¬ 
layered  feed  forward  neural  networks  (Wright  and  Gemelli 
2009)  and  examine  how  they  scale.  Non-linear  state  abstrac¬ 
tions  may  be  able  to  find  more  efficient  abstractions  of  mutli- 
dimensional  state  spaces  enabling  them  to  scale  more  effec¬ 
tively  as  the  number  of  features  is  increased. 
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